Similarity-preserving Metrics for Amino-acid Sequences
نویسنده
چکیده
Sequence alignments and sequence similarity scores derived from them are the most common tools for comparing amino acid and DNA sequences. Different scoring schemes, from simple +6/-1 to PAM and BLOSUM scoring matrices have been devised for highlighting particular biological or evolutionary properties of the sequences to compare. However, most methods of classical, as well as of non-parametrical statistics, including a number of neural network approaches, rely on another measure of similarity between data: the distance measure. It would be therefore of an advantage, if we could derive a distance between two sequences from their similarity score. Intuitively, it is clear that these two measures are somehow related: the higher the similarity between sequences, the lower the distance between them should be. But, contrary to similarity score, which can be defined in a fairly arbitrary (although not always meaningful) manner, there are three elementary requirements for a distance measure. A Distance measure on a set D is a function d : D × D → R which is equal to zero iff both arguments are equal, symmetric and satisfies the Cauchy-Schwarz-Bunyakovskii inequality (the triangle inequality). As a consequence, such a function is always positive-semidefinite. Distances are most commonly defined on vector spaces, but are not limited to them. Any space on which a function with above properties is defined is called a metric space. A number of metrics for strings have been proposed, many of them not really being metrics, for failing on one or more of the requirements. A simple and computationally very effective " distance " measure for sequences is the feature distance [1]. A feature is a short substring, usually referred to as N-gram, N being its length. The feature distance is then computed as the number of features the two strings differ in. It must be noted that this measure is not a distance, for two different strings can have zero distance. For example, strings, AABA and ABAA contain the same bigrams, so with N = 2 the " distance " between them is zero. Another very common distance measure for strings is the Levenshtein distance [2]. It measures the minimum effort needed to transform one string into another, using basic edit operations: replacement , insertion and deletion of a symbol. Generally, each of these operations has a cost assigned to it, in which case the distance function is usually referred to as weighted Levenshtein distance. …
منابع مشابه
Phylogenetic and sequence analysis of the growth hormone gene of two sturgeons, Huso huso and Acipenser Gueldenstaedtii
In this study, the cDNA Growth Hormone (cGH) of the Belugasturgeon (Husohuso) and Russian sturgeon (Acipensergueldenstaedtii) were cloned and sequenced, and phylogenetic relationships were examined using nucleic acid and amino acid sequences. The nucleotide sequence of the Beluga GH has an open reading frame of 645 nucleotides encoding a protein 214 amino acid residues. The signal peptide cleav...
متن کاملIdentification and characterization of a NBS–LRR class resistance gene analog in Pistacia atlantica subsp. Kurdica
P. atlantica subsp. Kurdica, with the local name of Baneh, is a wild medicinal plant which grows in Kurdistan, Iran. The identification of resistance gene analogs holds great promise for the development of resistant cultivars. A PCR approach with degenerate primers designed according to conserved NBS-LRR (nucleotide binding site-leucine rich repeat) regions of known disease-resistance (R) gene...
متن کاملSignal processing approaches as novel tools for the clustering of N-acetyl-β-D-glucosaminidases
Nowadays, the clustering of proteins and enzymes in particular, are one of the most popular topics in bioinformatics. Increasing number of chitinase genes from different organisms and their sequences have beenidentified. So far, various mathematical algorithms for the clustering of chitinase genes have been used butmost of them seem to be confusing and sometimes insufficient. In the...
متن کاملReduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment
MOTIVATION Many proteins with vastly dissimilar sequences are found to share a common fold, as evidenced in the wealth of structures now available in the Protein Data Bank. One idea that has found success in various applications is the concept of a reduced amino acid alphabet, wherein similar amino acids are clustered together. Given the structural similarity exhibited by many apparently dissim...
متن کاملAmino acid Substitution Mutations Analysis of gyrA and parC Genes in Clonal Lineage of Klebsiella pneumoniae conferring High-Level Quinolone Resistance
Background: Emergence Klebsiella pneumoniae resistant to quinolone antibiotics due to mutations in gyrA and parC genes created problem for treatment of patients in different hospitals in Iran. The objective of this study was to determine the amino acid substitutions of GyrA and ParC proteins in certain clonal lineages of the K. pneumoniae conferring high level quinolone resistance. Methods: One...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002